library(ggplot2)
library(dplyr)
library(statsr)load("movies.Rdata")The data set includes information from Rotten Tomatoes and IMDB for a randomly sampled 651 movies produced and released before 2016, showing their information about 32 variables. Since a random sampling method is used, any relationship found or models built based on the analysis of the sample can be generalized to other movies with caution. To avoid extrapolation, those who want to generalize the findings need to check if the the movies of interest are within the the scope of the movies in the sample. As the data is not from an experimentation study, no causal links should be made among the variables.
The analysis in this report will address one question: movies with which attributes are more popular? In order to achieve this, a multiple linear regression model will be built to find out the significant predictors of movie popularity. Such information could potentially be useful for movie producers and promoting companies.
To gather the information needed for the task, we’ll first examine the popularity of the movies in the sample as measured by two variables: audience_score, and imdb_rating. Both are numerical variables. According to the imdb website, imdb rating for a movie is aggregated and summarized from individual votes by registered users from 1 to 10. According to Rotten Tomato website, an audience score is calculated from ratings submitted to Rotten Tomatoes by users. It is the percentage of users who have rated the movie or TV shows positively. When at least 60% of users give a movie or TV show a star rating of 3.5 or higher, the movie or TV show will have audience rating of upright, whereas lower than 60% will entail a Spilled status. The audience_score in the data set are shown in numbers between 1 and 100, indicating percentages.
#summary of imdb_rating of the sampled movies
summary(movies$imdb_rating)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.900 5.900 6.600 6.493 7.300 9.000
boxplot(movies$imdb_rating, horizontal=TRUE)
text(x = boxplot.stats(movies$imdb_rating)$stats, labels = boxplot.stats(movies$imdb_rating)$stats, y=1.25)#count the number of outliers (imdb_rating<1st quantitle-1.5*sd)
movies%>%filter(imdb_rating<4.28)%>%count()## # A tibble: 1 x 1
## n
## <int>
## 1 28
As shown in the boxplot above, the minimum score for the imdb rating is 1.9, and maximum is 9. 50% of the rating is between 5.9 to 7.3.The distribution of the scores is left skewed, centred at 6.493. There are 28 outliers with extremely low score.
#summary of the audience_score of the sampled movies
summary(movies$audience_score)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 11.00 46.00 65.00 62.36 80.00 97.00
boxplot(movies$audience_score, horizontal=TRUE)
text(x=boxplot.stats(movies$audience_score)$stats, labels=boxplot.stats(movies$audience_score)$stats, y=1.25)As shown in the boxplot above, the audience_score for the sampled movies range from 11 to 97, wiht 50% of the movies scored between 46 to 80. The distribution of the score is slightly left skewed with a mean score of 62.36.
Although imdb_rating and audience_score are both measures of a movie’s popularity, what we don’t know is whether people visiting the two websites rate movies differently. To find out the relationship between the two popularity measurements, a linear regression model with the audience_score as the response variable, and the imdb_rating as the explanatory variable is fit.
ggplot(data=movies, aes(x=imdb_rating, y=audience_score))+
geom_jitter()+
stat_smooth(method="lm", se=F)## `geom_smooth()` using formula 'y ~ x'
Judging from the graph above, there is a strong, positive, and possibly linear relationship between the two variables. The histogram of the the residuals below shows that the distribution is centered at 0 but not exactly normal. The residuals vs.imdb_rating plot shows the residuals randomly scatter around 0, however, the residuals vs. fitted value plot shows that the variability is not exactly constant. Therefore it reasonable to question the reliability of using a simple linear model to predict audience_score based on imdb_rating.
imdb_tomato<-lm(movies$audience_score~movies$imdb_rating)
summary(imdb_tomato)##
## Call:
## lm(formula = movies$audience_score ~ movies$imdb_rating)
##
## Residuals:
## Min 1Q Median 3Q Max
## -26.800 -6.567 0.649 5.689 52.896
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -42.3284 2.4183 -17.50 <2e-16 ***
## movies$imdb_rating 16.1234 0.3674 43.89 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.16 on 649 degrees of freedom
## Multiple R-squared: 0.748, Adjusted R-squared: 0.7476
## F-statistic: 1926 on 1 and 649 DF, p-value: < 2.2e-16
#histogram of the residuals
hist(imdb_tomato$residuals)#plot the residuals vs. imdb_rating
plot(imdb_tomato$residuals, imdb_tomato$imdb_rating)#plot the residuals vs. fitted
plot(imdb_tomato$fitted, imdb_tomato$residuals)For the reason stated above, as well as the possilibity that the attributes accounting for the popularity of movies might differ between the two websites, two multiple linear models will be built to predict the popularity of movies, using audience_score and imdb_rating as the response variable respectively.
To fit the best models, the backward elimination method is selected, starting with a full model and drop one predictor at a time until the parsimonious model is reached. The Pvalue approach is used for this process because the task is to understand which variables are statistically significant predictors of the movie’s popularity.
First we need to decide which variables in the data will be excluded from the initial models.
For both models, the following variables are excluded.
title, because titleis a standalone attribute of the movie.
the theaters and dvd release year is excluded, and the month and day are kept. This is because the release month and day might affect how many people have watched and rated the movies.
critics_rating: critics score is kept instead because the rating is based on the score.
audience_rating: it is based on the audience_rating that the audience_score is generated and therefore it is repetitive.
director and actors: although movies from certain directors or having certain cast might appeal to audience more and therefore be more popular, these are standalone features of the movies and vary movie by movie.
the imdb_url and rt_url. These are links to websites unique to each movies.
8.studio. There are a total of 211 studios, simply too many to take into account if the goal is to build the simplest model with the highest prediction power.
For the first model using audience_score as the response variable, imdb_num_votes is also excluded because this is a unique feature to the imdb_rating.
After the selection above, the variables used to build the first multiple linear regression model are: title_type, genre, runtime, mpaa_rating, studio, thtr_rel_month, thtr_rel_day, dvd_rel_month, dvd_rel_day, critics_score, best_pic_nom, best_pic_win, best_actor_win, best_actress_win, best_dir_win, and top200_box, a total of 16 variables, in addition to the response variable audience_score.
The variabls used to build the second model are: title_type, genre, runtime, mpaa_rating, studio, thtr_rel_month, thtr_rel_day, dvd_rel_month, dvd_rel_day, critics_score, imdb_num_votes, best_pic_nom, best_pic_win, best_actor_win, best_actress_win, best_dir_win, and top200_box, a total of 17 variables, in addition to the response variable imdb_rating.
Using the backward elimination method, a full model for model 1 is built.
mlr_full<-lm(audience_score ~ title_type + genre + runtime +
mpaa_rating + thtr_rel_month + thtr_rel_day +
dvd_rel_month + dvd_rel_day + critics_score +
best_pic_nom + best_pic_win + best_actor_win +
best_actress_win + best_dir_win + top200_box, data=movies)
summary(mlr_full)##
## Call:
## lm(formula = audience_score ~ title_type + genre + runtime +
## mpaa_rating + thtr_rel_month + thtr_rel_day + dvd_rel_month +
## dvd_rel_day + critics_score + best_pic_nom + best_pic_win +
## best_actor_win + best_actress_win + best_dir_win + top200_box,
## data = movies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -36.122 -9.155 0.706 9.221 40.555
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 26.34639 7.51420 3.506 0.000488 ***
## title_typeFeature Film 1.36685 5.23541 0.261 0.794120
## title_typeTV Movie -5.33826 8.21397 -0.650 0.516000
## genreAnimation 5.95830 5.52499 1.078 0.281268
## genreArt House & International 7.64319 4.39432 1.739 0.082479 .
## genreComedy 0.34858 2.37884 0.147 0.883548
## genreDocumentary 11.97188 5.59156 2.141 0.032663 *
## genreDrama 2.37370 2.08755 1.137 0.255952
## genreHorror -8.40700 3.50715 -2.397 0.016824 *
## genreMusical & Performing Arts 11.27426 4.80226 2.348 0.019207 *
## genreMystery & Suspense -3.83581 2.65523 -1.445 0.149075
## genreOther 1.96786 4.02669 0.489 0.625226
## genreScience Fiction & Fantasy -5.54155 5.28200 -1.049 0.294528
## runtime 0.07344 0.03425 2.144 0.032387 *
## mpaa_ratingNC-17 -12.33077 10.68486 -1.154 0.248934
## mpaa_ratingPG -1.98019 3.99085 -0.496 0.619944
## mpaa_ratingPG-13 -3.01922 4.11198 -0.734 0.463078
## mpaa_ratingR -1.13732 3.97171 -0.286 0.774703
## mpaa_ratingUnrated -2.54291 4.58578 -0.555 0.579424
## thtr_rel_month -0.10471 0.16631 -0.630 0.529173
## thtr_rel_day 0.02776 0.06372 0.436 0.663250
## dvd_rel_month 0.22521 0.16902 1.332 0.183221
## dvd_rel_day 0.06116 0.06309 0.969 0.332694
## critics_score 0.43237 0.02338 18.491 < 2e-16 ***
## best_pic_nomyes 10.10517 3.66865 2.754 0.006054 **
## best_pic_winyes -1.88171 6.40907 -0.294 0.769163
## best_actor_winyes -1.34568 1.67535 -0.803 0.422159
## best_actress_winyes -2.22196 1.84643 -1.203 0.229293
## best_dir_winyes 0.02706 2.42273 0.011 0.991092
## top200_boxyes 4.32292 3.81909 1.132 0.258109
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.96 on 612 degrees of freedom
## (9 observations deleted due to missingness)
## Multiple R-squared: 0.5413, Adjusted R-squared: 0.5195
## F-statistic: 24.9 on 29 and 612 DF, p-value: < 2.2e-16
#drop the best director win which has the highest pvalue.
mlr_1<-lm(audience_score ~ title_type + genre + runtime +
mpaa_rating + thtr_rel_month + thtr_rel_day +
dvd_rel_month + dvd_rel_day + critics_score +
best_pic_nom + best_pic_win + best_actor_win +
best_actress_win + top200_box, data=movies)
summary(mlr_1)##
## Call:
## lm(formula = audience_score ~ title_type + genre + runtime +
## mpaa_rating + thtr_rel_month + thtr_rel_day + dvd_rel_month +
## dvd_rel_day + critics_score + best_pic_nom + best_pic_win +
## best_actor_win + best_actress_win + top200_box, data = movies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -36.099 -9.155 0.713 9.219 40.553
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 26.33764 7.46720 3.527 0.000452 ***
## title_typeFeature Film 1.36905 5.22744 0.262 0.793489
## title_typeTV Movie -5.33662 8.20594 -0.650 0.515719
## genreAnimation 5.95771 5.52022 1.079 0.280900
## genreArt House & International 7.64160 4.38844 1.741 0.082132 .
## genreComedy 0.34818 2.37663 0.147 0.883573
## genreDocumentary 11.97207 5.58697 2.143 0.032518 *
## genreDrama 2.37273 2.08406 1.139 0.255351
## genreHorror -8.40691 3.50428 -2.399 0.016736 *
## genreMusical & Performing Arts 11.27365 4.79804 2.350 0.019108 *
## genreMystery & Suspense -3.83614 2.65290 -1.446 0.148683
## genreOther 1.96677 4.02221 0.489 0.625033
## genreScience Fiction & Fantasy -5.54007 5.27603 -1.050 0.294111
## runtime 0.07350 0.03381 2.174 0.030096 *
## mpaa_ratingNC-17 -12.33173 10.67579 -1.155 0.248495
## mpaa_ratingPG -1.97885 3.98579 -0.496 0.619736
## mpaa_ratingPG-13 -3.01825 4.10770 -0.735 0.462756
## mpaa_ratingR -1.13603 3.96679 -0.286 0.774680
## mpaa_ratingUnrated -2.54319 4.58197 -0.555 0.579067
## thtr_rel_month -0.10467 0.16613 -0.630 0.528897
## thtr_rel_day 0.02774 0.06365 0.436 0.663095
## dvd_rel_month 0.22512 0.16870 1.334 0.182555
## dvd_rel_day 0.06118 0.06300 0.971 0.331876
## critics_score 0.43240 0.02319 18.646 < 2e-16 ***
## best_pic_nomyes 10.10241 3.65733 2.762 0.005913 **
## best_pic_winyes -1.86155 6.14479 -0.303 0.762032
## best_actor_winyes -1.34491 1.67256 -0.804 0.421651
## best_actress_winyes -2.22192 1.84492 -1.204 0.228921
## top200_boxyes 4.32183 3.81472 1.133 0.257685
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.95 on 613 degrees of freedom
## (9 observations deleted due to missingness)
## Multiple R-squared: 0.5413, Adjusted R-squared: 0.5203
## F-statistic: 25.83 on 28 and 613 DF, p-value: < 2.2e-16
#The next highest pvalue is title_typeFeature film. As the other level in title_type also has a relatively high pvalue, this variable is dropped next.
mlr_2<-lm(audience_score ~ genre + runtime + mpaa_rating +
thtr_rel_month + thtr_rel_day + dvd_rel_month +
dvd_rel_day + critics_score + best_pic_nom +
best_pic_win + best_actor_win + best_actress_win +
top200_box, data=movies)
summary(mlr_2)##
## Call:
## lm(formula = audience_score ~ genre + runtime + mpaa_rating +
## thtr_rel_month + thtr_rel_day + dvd_rel_month + dvd_rel_day +
## critics_score + best_pic_nom + best_pic_win + best_actor_win +
## best_actress_win + top200_box, data = movies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -36.198 -9.234 0.767 9.125 40.592
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 27.49955 5.32098 5.168 3.2e-07 ***
## genreAnimation 5.98757 5.51601 1.085 0.278130
## genreArt House & International 7.84324 4.38019 1.791 0.073847 .
## genreComedy 0.32484 2.37229 0.137 0.891130
## genreDocumentary 11.02854 3.22833 3.416 0.000677 ***
## genreDrama 2.30506 2.07996 1.108 0.268199
## genreHorror -8.30551 3.50051 -2.373 0.017967 *
## genreMusical & Performing Arts 10.89389 4.53069 2.404 0.016491 *
## genreMystery & Suspense -3.82048 2.65088 -1.441 0.150035
## genreOther 1.56143 3.99768 0.391 0.696239
## genreScience Fiction & Fantasy -5.53489 5.27226 -1.050 0.294216
## runtime 0.07476 0.03376 2.214 0.027193 *
## mpaa_ratingNC-17 -12.26393 10.66811 -1.150 0.250761
## mpaa_ratingPG -1.93296 3.98279 -0.485 0.627616
## mpaa_ratingPG-13 -2.96576 4.10456 -0.723 0.470230
## mpaa_ratingR -1.15084 3.96391 -0.290 0.771662
## mpaa_ratingUnrated -3.05610 4.54827 -0.672 0.501883
## thtr_rel_month -0.09937 0.16593 -0.599 0.549504
## thtr_rel_day 0.02599 0.06358 0.409 0.682831
## dvd_rel_month 0.22467 0.16844 1.334 0.182744
## dvd_rel_day 0.06480 0.06286 1.031 0.303026
## critics_score 0.43218 0.02295 18.832 < 2e-16 ***
## best_pic_nomyes 10.15562 3.65429 2.779 0.005618 **
## best_pic_winyes -1.89095 6.14007 -0.308 0.758210
## best_actor_winyes -1.29059 1.67057 -0.773 0.440086
## best_actress_winyes -2.28135 1.84270 -1.238 0.216171
## top200_boxyes 4.34395 3.81159 1.140 0.254867
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.94 on 615 degrees of freedom
## (9 observations deleted due to missingness)
## Multiple R-squared: 0.5404, Adjusted R-squared: 0.521
## F-statistic: 27.81 on 26 and 615 DF, p-value: < 2.2e-16
#mpaa_ratingR now has the highest pvalue. As the other levels in mpaa_rating also have high pvalues, next mpaa_rating is dropped.
mlr_3<-lm(audience_score ~ genre + runtime + thtr_rel_month +
thtr_rel_day + dvd_rel_month + dvd_rel_day +
critics_score + best_pic_nom + best_pic_win +
best_actor_win + best_actress_win + top200_box, data=movies)
summary(mlr_3)##
## Call:
## lm(formula = audience_score ~ genre + runtime + thtr_rel_month +
## thtr_rel_day + dvd_rel_month + dvd_rel_day + critics_score +
## best_pic_nom + best_pic_win + best_actor_win + best_actress_win +
## top200_box, data = movies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -37.100 -9.241 0.485 9.143 41.174
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 26.29710 4.18119 6.289 6.02e-10 ***
## genreAnimation 6.98792 5.02253 1.391 0.164629
## genreArt House & International 7.66665 4.27817 1.792 0.073614 .
## genreComedy 0.05807 2.34506 0.025 0.980252
## genreDocumentary 10.21015 2.87958 3.546 0.000421 ***
## genreDrama 2.29966 2.03326 1.131 0.258484
## genreHorror -7.99693 3.42676 -2.334 0.019932 *
## genreMusical & Performing Arts 10.83507 4.49918 2.408 0.016321 *
## genreMystery & Suspense -3.49095 2.58906 -1.348 0.178039
## genreOther 1.51691 3.97289 0.382 0.702730
## genreScience Fiction & Fantasy -5.25006 5.26030 -0.998 0.318642
## runtime 0.06902 0.03303 2.090 0.037045 *
## thtr_rel_month -0.07372 0.16455 -0.448 0.654299
## thtr_rel_day 0.01831 0.06292 0.291 0.771090
## dvd_rel_month 0.22114 0.16769 1.319 0.187738
## dvd_rel_day 0.05922 0.06264 0.945 0.344804
## critics_score 0.43417 0.02232 19.453 < 2e-16 ***
## best_pic_nomyes 10.07441 3.63917 2.768 0.005803 **
## best_pic_winyes -1.52958 6.11686 -0.250 0.802624
## best_actor_winyes -1.42373 1.65894 -0.858 0.391106
## best_actress_winyes -2.31118 1.83720 -1.258 0.208870
## top200_boxyes 4.31283 3.76552 1.145 0.252507
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.92 on 620 degrees of freedom
## (9 observations deleted due to missingness)
## Multiple R-squared: 0.5381, Adjusted R-squared: 0.5224
## F-statistic: 34.39 on 21 and 620 DF, p-value: < 2.2e-16
#best picture win is dropped
mlr_4<-lm(audience_score ~ genre + runtime + thtr_rel_month +
thtr_rel_day + dvd_rel_month + dvd_rel_day +
critics_score + best_pic_nom + best_actor_win +
best_actress_win + top200_box, data=movies)
summary(mlr_4)##
## Call:
## lm(formula = audience_score ~ genre + runtime + thtr_rel_month +
## thtr_rel_day + dvd_rel_month + dvd_rel_day + critics_score +
## best_pic_nom + best_actor_win + best_actress_win + top200_box,
## data = movies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -37.064 -9.202 0.505 9.157 41.150
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 26.36948 4.16802 6.327 4.8e-10 ***
## genreAnimation 6.97936 5.01862 1.391 0.164817
## genreArt House & International 7.67119 4.27490 1.794 0.073224 .
## genreComedy 0.04120 2.34232 0.018 0.985972
## genreDocumentary 10.21328 2.87737 3.550 0.000415 ***
## genreDrama 2.30913 2.03138 1.137 0.256088
## genreHorror -8.00496 3.42403 -2.338 0.019710 *
## genreMusical & Performing Arts 10.85371 4.49517 2.415 0.016044 *
## genreMystery & Suspense -3.49500 2.58705 -1.351 0.177200
## genreOther 1.58172 3.96144 0.399 0.689825
## genreScience Fiction & Fantasy -5.24632 5.25631 -0.998 0.318619
## runtime 0.06817 0.03283 2.077 0.038245 *
## thtr_rel_month -0.07126 0.16413 -0.434 0.664306
## thtr_rel_day 0.01789 0.06285 0.285 0.776043
## dvd_rel_month 0.22430 0.16709 1.342 0.179951
## dvd_rel_day 0.05865 0.06255 0.938 0.348808
## critics_score 0.43396 0.02229 19.472 < 2e-16 ***
## best_pic_nomyes 9.67774 3.27277 2.957 0.003224 **
## best_actor_winyes -1.39308 1.65316 -0.843 0.399734
## best_actress_winyes -2.33439 1.83347 -1.273 0.203420
## top200_boxyes 4.27439 3.75954 1.137 0.255999
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.91 on 621 degrees of freedom
## (9 observations deleted due to missingness)
## Multiple R-squared: 0.538, Adjusted R-squared: 0.5231
## F-statistic: 36.16 on 20 and 621 DF, p-value: < 2.2e-16
#the theatre realse day is dropped
mlr_5<-lm(audience_score ~ genre + runtime + thtr_rel_month +
dvd_rel_month + dvd_rel_day + critics_score +
best_pic_nom + best_actor_win + best_actress_win +
top200_box, data=movies)
summary(mlr_5)##
## Call:
## lm(formula = audience_score ~ genre + runtime + thtr_rel_month +
## dvd_rel_month + dvd_rel_day + critics_score + best_pic_nom +
## best_actor_win + best_actress_win + top200_box, data = movies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -36.955 -9.046 0.491 9.169 41.014
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 26.59393 4.08970 6.503 1.62e-10 ***
## genreAnimation 6.97788 5.01491 1.391 0.164594
## genreArt House & International 7.74985 4.26280 1.818 0.069542 .
## genreComedy 0.02229 2.33965 0.010 0.992403
## genreDocumentary 10.20439 2.87508 3.549 0.000415 ***
## genreDrama 2.31447 2.02979 1.140 0.254621
## genreHorror -7.98608 3.42085 -2.335 0.019885 *
## genreMusical & Performing Arts 10.86408 4.49170 2.419 0.015862 *
## genreMystery & Suspense -3.48632 2.58496 -1.349 0.177926
## genreOther 1.52538 3.95356 0.386 0.699760
## genreScience Fiction & Fantasy -5.34717 5.24047 -1.020 0.307954
## runtime 0.06813 0.03280 2.077 0.038219 *
## thtr_rel_month -0.06617 0.16303 -0.406 0.684985
## dvd_rel_month 0.22388 0.16696 1.341 0.180433
## dvd_rel_day 0.05867 0.06250 0.939 0.348231
## critics_score 0.43401 0.02227 19.490 < 2e-16 ***
## best_pic_nomyes 9.66991 3.27023 2.957 0.003225 **
## best_actor_winyes -1.39113 1.65192 -0.842 0.400040
## best_actress_winyes -2.31205 1.83043 -1.263 0.207020
## top200_boxyes 4.27752 3.75674 1.139 0.255299
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.9 on 622 degrees of freedom
## (9 observations deleted due to missingness)
## Multiple R-squared: 0.538, Adjusted R-squared: 0.5238
## F-statistic: 38.12 on 19 and 622 DF, p-value: < 2.2e-16
#thtr_rel_month is dropped
mlr_6<-lm(audience_score ~ genre + runtime + dvd_rel_month +
dvd_rel_day + critics_score + best_pic_nom +
best_actor_win + best_actress_win + top200_box, data=movies)
summary(mlr_6)##
## Call:
## lm(formula = audience_score ~ genre + runtime + dvd_rel_month +
## dvd_rel_day + critics_score + best_pic_nom + best_actor_win +
## best_actress_win + top200_box, data = movies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -36.861 -9.220 0.724 9.171 40.830
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 26.354511 4.044217 6.517 1.49e-10 ***
## genreAnimation 6.901592 5.008028 1.378 0.168665
## genreArt House & International 7.723126 4.259432 1.813 0.070285 .
## genreComedy -0.008771 2.336828 -0.004 0.997007
## genreDocumentary 10.193208 2.873018 3.548 0.000418 ***
## genreDrama 2.328610 2.028127 1.148 0.251344
## genreHorror -8.007652 3.418148 -2.343 0.019459 *
## genreMusical & Performing Arts 10.836455 4.488173 2.414 0.016046 *
## genreMystery & Suspense -3.436569 2.580320 -1.332 0.183400
## genreOther 1.614415 3.944826 0.409 0.682498
## genreScience Fiction & Fantasy -5.309149 5.236122 -1.014 0.311002
## runtime 0.065596 0.032182 2.038 0.041944 *
## dvd_rel_month 0.234706 0.164704 1.425 0.154653
## dvd_rel_day 0.058190 0.062449 0.932 0.351798
## critics_score 0.434065 0.022253 19.506 < 2e-16 ***
## best_pic_nomyes 9.516042 3.246005 2.932 0.003496 **
## best_actor_winyes -1.403475 1.650534 -0.850 0.395476
## best_actress_winyes -2.303588 1.829087 -1.259 0.208350
## top200_boxyes 4.191091 3.748190 1.118 0.263928
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.89 on 623 degrees of freedom
## (9 observations deleted due to missingness)
## Multiple R-squared: 0.5378, Adjusted R-squared: 0.5245
## F-statistic: 40.28 on 18 and 623 DF, p-value: < 2.2e-16
#best_actor_win is dropped
mlr_7<-lm(audience_score ~ genre + runtime + dvd_rel_month +
dvd_rel_day + critics_score + best_pic_nom +
best_actress_win + top200_box, data=movies)
summary(mlr_7)##
## Call:
## lm(formula = audience_score ~ genre + runtime + dvd_rel_month +
## dvd_rel_day + critics_score + best_pic_nom + best_actress_win +
## top200_box, data = movies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -36.589 -9.025 0.645 9.234 40.997
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 26.630359 4.030290 6.608 8.38e-11 ***
## genreAnimation 6.839265 5.006381 1.366 0.172397
## genreArt House & International 7.868057 4.255075 1.849 0.064916 .
## genreComedy -0.001377 2.336293 -0.001 0.999530
## genreDocumentary 10.238679 2.871883 3.565 0.000391 ***
## genreDrama 2.291462 2.027207 1.130 0.258761
## genreHorror -7.916760 3.415718 -2.318 0.020786 *
## genreMusical & Performing Arts 10.929612 4.485840 2.436 0.015110 *
## genreMystery & Suspense -3.607689 2.571890 -1.403 0.161192
## genreOther 1.588195 3.943830 0.403 0.687304
## genreScience Fiction & Fantasy -5.181804 5.232818 -0.990 0.322435
## runtime 0.060520 0.031616 1.914 0.056054 .
## dvd_rel_month 0.247860 0.163939 1.512 0.131066
## dvd_rel_day 0.058656 0.062433 0.940 0.347833
## critics_score 0.434203 0.022248 19.517 < 2e-16 ***
## best_pic_nomyes 9.270932 3.232463 2.868 0.004269 **
## best_actress_winyes -2.386234 1.826097 -1.307 0.191782
## top200_boxyes 4.132424 3.746723 1.103 0.270477
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.89 on 624 degrees of freedom
## (9 observations deleted due to missingness)
## Multiple R-squared: 0.5373, Adjusted R-squared: 0.5247
## F-statistic: 42.62 on 17 and 624 DF, p-value: < 2.2e-16
#dvd release day is dropped
mlr_8<-lm(audience_score ~ genre + runtime + dvd_rel_month +
critics_score + best_pic_nom + best_actress_win +
top200_box, data=movies)
summary(mlr_8)##
## Call:
## lm(formula = audience_score ~ genre + runtime + dvd_rel_month +
## critics_score + best_pic_nom + best_actress_win + top200_box,
## data = movies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -35.680 -9.148 0.669 9.388 40.321
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 27.49348 3.92382 7.007 6.33e-12 ***
## genreAnimation 6.93264 5.00492 1.385 0.166497
## genreArt House & International 7.93435 4.25409 1.865 0.062634 .
## genreComedy 0.05742 2.33524 0.025 0.980392
## genreDocumentary 10.38257 2.86753 3.621 0.000318 ***
## genreDrama 2.45719 2.01933 1.217 0.224127
## genreHorror -7.90766 3.41538 -2.315 0.020919 *
## genreMusical & Performing Arts 11.27263 4.47054 2.522 0.011932 *
## genreMystery & Suspense -3.49772 2.56898 -1.362 0.173841
## genreOther 1.77881 3.93824 0.452 0.651658
## genreScience Fiction & Fantasy -4.95583 5.22680 -0.948 0.343415
## runtime 0.06042 0.03161 1.911 0.056442 .
## dvd_rel_month 0.24470 0.16389 1.493 0.135920
## critics_score 0.43281 0.02220 19.499 < 2e-16 ***
## best_pic_nomyes 9.41640 3.22845 2.917 0.003665 **
## best_actress_winyes -2.39461 1.82590 -1.311 0.190183
## top200_boxyes 4.12255 3.74636 1.100 0.271574
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.89 on 625 degrees of freedom
## (9 observations deleted due to missingness)
## Multiple R-squared: 0.5367, Adjusted R-squared: 0.5248
## F-statistic: 45.24 on 16 and 625 DF, p-value: < 2.2e-16
#top 200 box is dropped
mlr_9<-lm(audience_score ~ genre + runtime + dvd_rel_month +
critics_score + best_pic_nom + best_actress_win, data=movies)
summary(mlr_9)##
## Call:
## lm(formula = audience_score ~ genre + runtime + dvd_rel_month +
## critics_score + best_pic_nom + best_actress_win, data = movies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -36.180 -9.147 0.667 9.465 40.009
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 27.33967 3.92199 6.971 8.01e-12 ***
## genreAnimation 6.63035 4.99822 1.327 0.18514
## genreArt House & International 7.57011 4.24191 1.785 0.07481 .
## genreComedy -0.20717 2.32322 -0.089 0.92897
## genreDocumentary 9.95980 2.84215 3.504 0.00049 ***
## genreDrama 2.10992 1.99485 1.058 0.29061
## genreHorror -8.19921 3.40566 -2.408 0.01635 *
## genreMusical & Performing Arts 10.82121 4.45243 2.430 0.01536 *
## genreMystery & Suspense -3.83338 2.55124 -1.503 0.13346
## genreOther 1.59731 3.93545 0.406 0.68497
## genreScience Fiction & Fantasy -4.79809 5.22571 -0.918 0.35888
## runtime 0.06382 0.03147 2.028 0.04295 *
## dvd_rel_month 0.24656 0.16391 1.504 0.13302
## critics_score 0.43552 0.02206 19.739 < 2e-16 ***
## best_pic_nomyes 9.51517 3.22775 2.948 0.00332 **
## best_actress_winyes -2.27472 1.82296 -1.248 0.21256
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.89 on 626 degrees of freedom
## (9 observations deleted due to missingness)
## Multiple R-squared: 0.5358, Adjusted R-squared: 0.5246
## F-statistic: 48.16 on 15 and 626 DF, p-value: < 2.2e-16
#best_actress_win is dropped
mlr_10<-lm(audience_score ~ genre + runtime + dvd_rel_month +
critics_score + best_pic_nom, data=movies)
summary(mlr_10)##
## Call:
## lm(formula = audience_score ~ genre + runtime + dvd_rel_month +
## critics_score + best_pic_nom, data = movies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -35.995 -8.985 0.618 9.515 40.085
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 27.90584 3.89738 7.160 2.26e-12 ***
## genreAnimation 6.28742 4.99288 1.259 0.208399
## genreArt House & International 7.38964 4.24132 1.742 0.081947 .
## genreComedy -0.47456 2.31434 -0.205 0.837600
## genreDocumentary 9.87441 2.84259 3.474 0.000549 ***
## genreDrama 1.82168 1.98231 0.919 0.358466
## genreHorror -8.26404 3.40678 -2.426 0.015557 *
## genreMusical & Performing Arts 10.87747 4.45418 2.442 0.014878 *
## genreMystery & Suspense -4.16473 2.53851 -1.641 0.101378
## genreOther 1.42751 3.93484 0.363 0.716887
## genreScience Fiction & Fantasy -4.81004 5.22802 -0.920 0.357900
## runtime 0.05841 0.03118 1.873 0.061497 .
## dvd_rel_month 0.24693 0.16398 1.506 0.132610
## critics_score 0.43541 0.02207 19.726 < 2e-16 ***
## best_pic_nomyes 8.92614 3.19446 2.794 0.005361 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.9 on 627 degrees of freedom
## (9 observations deleted due to missingness)
## Multiple R-squared: 0.5346, Adjusted R-squared: 0.5242
## F-statistic: 51.44 on 14 and 627 DF, p-value: < 2.2e-16
#dvd_rel_month is dropped
mlr_11<-lm(audience_score ~ genre + runtime + critics_score +
best_pic_nom, data=movies)
summary(mlr_11)##
## Call:
## lm(formula = audience_score ~ genre + runtime + critics_score +
## best_pic_nom, data = movies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -35.636 -9.437 0.505 9.068 41.431
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 29.73777 3.67688 8.088 3.08e-15 ***
## genreAnimation 5.72179 4.97508 1.150 0.250539
## genreArt House & International 5.81573 4.10129 1.418 0.156673
## genreComedy -0.77027 2.28922 -0.336 0.736622
## genreDocumentary 9.77182 2.79904 3.491 0.000514 ***
## genreDrama 1.53573 1.95587 0.785 0.432634
## genreHorror -8.41533 3.39202 -2.481 0.013362 *
## genreMusical & Performing Arts 10.32291 4.43912 2.325 0.020362 *
## genreMystery & Suspense -4.42960 2.51990 -1.758 0.079254 .
## genreOther 1.07846 3.92206 0.275 0.783426
## genreScience Fiction & Fantasy -6.50856 4.94723 -1.316 0.188784
## runtime 0.05607 0.03100 1.809 0.070966 .
## critics_score 0.43993 0.02195 20.045 < 2e-16 ***
## best_pic_nomyes 8.77152 3.19268 2.747 0.006177 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.9 on 636 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.5378, Adjusted R-squared: 0.5283
## F-statistic: 56.92 on 13 and 636 DF, p-value: < 2.2e-16
#runtime is dropped
mlr_fin<-lm(audience_score ~ genre + critics_score + best_pic_nom, data=movies)
summary(mlr_fin)##
## Call:
## lm(formula = audience_score ~ genre + critics_score + best_pic_nom,
## data = movies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -35.912 -9.413 0.263 9.303 42.404
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 35.37484 1.94874 18.153 < 2e-16 ***
## genreAnimation 4.75334 4.95247 0.960 0.33752
## genreArt House & International 5.67764 4.10574 1.383 0.16719
## genreComedy -1.16710 2.28154 -0.512 0.60915
## genreDocumentary 9.00721 2.76845 3.254 0.00120 **
## genreDrama 1.76581 1.95409 0.904 0.36652
## genreHorror -9.08085 3.37626 -2.690 0.00734 **
## genreMusical & Performing Arts 10.72496 4.43899 2.416 0.01597 *
## genreMystery & Suspense -4.17514 2.51908 -1.657 0.09793 .
## genreOther 1.23054 3.92605 0.313 0.75406
## genreScience Fiction & Fantasy -6.70347 4.95228 -1.354 0.17634
## critics_score 0.44435 0.02184 20.342 < 2e-16 ***
## best_pic_nomyes 10.03913 3.11827 3.219 0.00135 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.91 on 638 degrees of freedom
## Multiple R-squared: 0.5353, Adjusted R-squared: 0.5266
## F-statistic: 61.25 on 12 and 638 DF, p-value: < 2.2e-16
After the above elimination, now we have a model containing 3 predictor variables, among which genre has 11 levels. 3 of these 11 levels have pvalue smaller than 0.05, genre will stay in the final model.
levels(movies$genre)## [1] "Action & Adventure" "Animation"
## [3] "Art House & International" "Comedy"
## [5] "Documentary" "Drama"
## [7] "Horror" "Musical & Performing Arts"
## [9] "Mystery & Suspense" "Other"
## [11] "Science Fiction & Fantasy"
Therefore the final multiple linear model to predict the audience_score of a movie consists of genre, critics_score, and best_pic_nom. The adjusted R squared is 0.5266.
The equation for this regression model could be written as:
audience_score=35.375+4.75genreAnimation+5.68Genre Arthouse & International-1.17GenreComedy + 9genreDocumentary+1.77genreDrama-9.08GenreHorror+10.72GenreMusical & performing arts-4.18Genre Mystery & suspense+1.23Genreother-6.70genre Science Fiction & Fantasy+ 0.44critics_score +10.04best_pic_nomyes
The reference level for genre is Action and adventure, and the referene level for best_pic_nom is no.
In the context of the data, the model predicts that, all else held constant, a movie in the genre of horror is expected to have the lowest score on average compared with movies in other genres. The slope for best_pic_nomyes is 10.04, meaning that, all else held constant, the model predicts that a movie that has been nominated for an Oscar best picture on average is expected to have a score 10.4 higher those who haven’t been. All else held constant, a movie that has an additional critics score is expected to have 0.44 additional audience score on average.
Now let’s fit the second model using imdb_rating as the response variable following a similar elimination process.
#full model
mlr2_full<-lm(imdb_rating ~ title_type + genre + runtime +
mpaa_rating + thtr_rel_month + thtr_rel_day +
dvd_rel_month + dvd_rel_day + critics_score +
imdb_num_votes + best_pic_nom + best_pic_win +
best_actor_win + best_actress_win + best_dir_win +
top200_box, data=movies)
summary(mlr2_full)##
## Call:
## lm(formula = imdb_rating ~ title_type + genre + runtime + mpaa_rating +
## thtr_rel_month + thtr_rel_day + dvd_rel_month + dvd_rel_day +
## critics_score + imdb_num_votes + best_pic_nom + best_pic_win +
## best_actor_win + best_actress_win + best_dir_win + top200_box,
## data = movies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.84167 -0.32527 0.05607 0.38325 1.82268
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.551e+00 3.435e-01 13.252 < 2e-16 ***
## title_typeFeature Film -1.184e-01 2.373e-01 -0.499 0.61809
## title_typeTV Movie -4.808e-01 3.721e-01 -1.292 0.19672
## genreAnimation -3.027e-01 2.504e-01 -1.209 0.22726
## genreArt House & International 6.578e-01 2.000e-01 3.290 0.00106 **
## genreComedy -1.029e-01 1.080e-01 -0.953 0.34112
## genreDocumentary 7.014e-01 2.535e-01 2.767 0.00583 **
## genreDrama 2.165e-01 9.544e-02 2.268 0.02367 *
## genreHorror -1.339e-01 1.592e-01 -0.841 0.40045
## genreMusical & Performing Arts 5.152e-01 2.183e-01 2.360 0.01859 *
## genreMystery & Suspense 1.465e-01 1.204e-01 1.217 0.22420
## genreOther 3.895e-03 1.825e-01 0.021 0.98297
## genreScience Fiction & Fantasy -2.816e-01 2.392e-01 -1.177 0.23971
## runtime 4.561e-03 1.584e-03 2.880 0.00411 **
## mpaa_ratingNC-17 -5.944e-01 4.840e-01 -1.228 0.21982
## mpaa_ratingPG -2.215e-01 1.808e-01 -1.225 0.22089
## mpaa_ratingPG-13 -3.039e-01 1.868e-01 -1.626 0.10437
## mpaa_ratingR -1.985e-01 1.803e-01 -1.101 0.27146
## mpaa_ratingUnrated -2.701e-01 2.077e-01 -1.300 0.19404
## thtr_rel_month 6.010e-03 7.533e-03 0.798 0.42532
## thtr_rel_day -9.969e-04 2.892e-03 -0.345 0.73039
## dvd_rel_month 1.319e-02 7.665e-03 1.721 0.08571 .
## dvd_rel_day 4.570e-03 2.859e-03 1.599 0.11043
## critics_score 2.318e-02 1.082e-03 21.414 < 2e-16 ***
## imdb_num_votes 2.032e-06 2.725e-07 7.456 3.07e-13 ***
## best_pic_nomyes 1.175e-01 1.676e-01 0.701 0.48357
## best_pic_winyes -3.360e-01 2.948e-01 -1.140 0.25481
## best_actor_winyes 1.400e-02 7.593e-02 0.184 0.85382
## best_actress_winyes -7.112e-03 8.364e-02 -0.085 0.93226
## best_dir_winyes 3.534e-02 1.098e-01 0.322 0.74757
## top200_boxyes -2.047e-01 1.781e-01 -1.150 0.25066
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6325 on 611 degrees of freedom
## (9 observations deleted due to missingness)
## Multiple R-squared: 0.6697, Adjusted R-squared: 0.6535
## F-statistic: 41.29 on 30 and 611 DF, p-value: < 2.2e-16
#drop best_actress_win
mlr2_1<-lm(imdb_rating ~ title_type + genre + runtime +
mpaa_rating + thtr_rel_month + thtr_rel_day +
dvd_rel_month + dvd_rel_day + critics_score +
imdb_num_votes + best_pic_nom + best_pic_win +
best_actor_win + best_dir_win + top200_box, data=movies)
summary(mlr2_1)##
## Call:
## lm(formula = imdb_rating ~ title_type + genre + runtime + mpaa_rating +
## thtr_rel_month + thtr_rel_day + dvd_rel_month + dvd_rel_day +
## critics_score + imdb_num_votes + best_pic_nom + best_pic_win +
## best_actor_win + best_dir_win + top200_box, data = movies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.84068 -0.32494 0.05393 0.38425 1.82344
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.553e+00 3.425e-01 13.293 < 2e-16 ***
## title_typeFeature Film -1.184e-01 2.371e-01 -0.499 0.61763
## title_typeTV Movie -4.817e-01 3.716e-01 -1.296 0.19542
## genreAnimation -3.038e-01 2.499e-01 -1.216 0.22460
## genreArt House & International 6.571e-01 1.996e-01 3.292 0.00105 **
## genreComedy -1.037e-01 1.074e-01 -0.966 0.33442
## genreDocumentary 7.009e-01 2.532e-01 2.768 0.00581 **
## genreDrama 2.155e-01 9.465e-02 2.277 0.02315 *
## genreHorror -1.343e-01 1.590e-01 -0.845 0.39854
## genreMusical & Performing Arts 5.151e-01 2.181e-01 2.362 0.01850 *
## genreMystery & Suspense 1.454e-01 1.196e-01 1.216 0.22460
## genreOther 3.305e-03 1.822e-01 0.018 0.98553
## genreScience Fiction & Fantasy -2.817e-01 2.391e-01 -1.178 0.23913
## runtime 4.547e-03 1.573e-03 2.890 0.00399 **
## mpaa_ratingNC-17 -5.934e-01 4.834e-01 -1.228 0.22009
## mpaa_ratingPG -2.216e-01 1.806e-01 -1.227 0.22024
## mpaa_ratingPG-13 -3.040e-01 1.867e-01 -1.629 0.10392
## mpaa_ratingR -1.984e-01 1.801e-01 -1.101 0.27130
## mpaa_ratingUnrated -2.699e-01 2.075e-01 -1.300 0.19396
## thtr_rel_month 6.016e-03 7.527e-03 0.799 0.42449
## thtr_rel_day -1.006e-03 2.887e-03 -0.348 0.72777
## dvd_rel_month 1.319e-02 7.659e-03 1.722 0.08556 .
## dvd_rel_day 4.571e-03 2.857e-03 1.600 0.11007
## critics_score 2.318e-02 1.081e-03 21.431 < 2e-16 ***
## imdb_num_votes 2.032e-06 2.723e-07 7.463 2.93e-13 ***
## best_pic_nomyes 1.161e-01 1.667e-01 0.697 0.48635
## best_pic_winyes -3.372e-01 2.942e-01 -1.146 0.25213
## best_actor_winyes 1.363e-02 7.575e-02 0.180 0.85722
## best_dir_winyes 3.532e-02 1.097e-01 0.322 0.74752
## top200_boxyes -2.055e-01 1.777e-01 -1.157 0.24786
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.632 on 612 degrees of freedom
## (9 observations deleted due to missingness)
## Multiple R-squared: 0.6697, Adjusted R-squared: 0.654
## F-statistic: 42.79 on 29 and 612 DF, p-value: < 2.2e-16
#drop best_actor_win
mlr2_2<-lm(imdb_rating ~ title_type + genre + runtime +
mpaa_rating + thtr_rel_month + thtr_rel_day +
dvd_rel_month + dvd_rel_day + critics_score +
imdb_num_votes + best_pic_nom + best_pic_win +
best_dir_win + top200_box, data=movies)
summary(mlr2_2)##
## Call:
## lm(formula = imdb_rating ~ title_type + genre + runtime + mpaa_rating +
## thtr_rel_month + thtr_rel_day + dvd_rel_month + dvd_rel_day +
## critics_score + imdb_num_votes + best_pic_nom + best_pic_win +
## best_dir_win + top200_box, data = movies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.84352 -0.32593 0.05214 0.38284 1.82108
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.549e+00 3.414e-01 13.325 < 2e-16 ***
## title_typeFeature Film -1.178e-01 2.369e-01 -0.497 0.61934
## title_typeTV Movie -4.824e-01 3.713e-01 -1.299 0.19437
## genreAnimation -3.028e-01 2.496e-01 -1.213 0.22555
## genreArt House & International 6.560e-01 1.994e-01 3.290 0.00106 **
## genreComedy -1.036e-01 1.073e-01 -0.966 0.33444
## genreDocumentary 7.015e-01 2.530e-01 2.773 0.00573 **
## genreDrama 2.160e-01 9.453e-02 2.285 0.02265 *
## genreHorror -1.350e-01 1.588e-01 -0.850 0.39570
## genreMusical & Performing Arts 5.144e-01 2.179e-01 2.361 0.01855 *
## genreMystery & Suspense 1.474e-01 1.190e-01 1.238 0.21612
## genreOther 3.568e-03 1.820e-01 0.020 0.98437
## genreScience Fiction & Fantasy -2.828e-01 2.388e-01 -1.185 0.23666
## runtime 4.600e-03 1.544e-03 2.980 0.00300 **
## mpaa_ratingNC-17 -5.875e-01 4.819e-01 -1.219 0.22329
## mpaa_ratingPG -2.206e-01 1.804e-01 -1.223 0.22179
## mpaa_ratingPG-13 -3.037e-01 1.865e-01 -1.628 0.10401
## mpaa_ratingR -1.981e-01 1.800e-01 -1.101 0.27148
## mpaa_ratingUnrated -2.701e-01 2.074e-01 -1.302 0.19331
## thtr_rel_month 6.029e-03 7.521e-03 0.802 0.42303
## thtr_rel_day -1.000e-03 2.885e-03 -0.347 0.72901
## dvd_rel_month 1.306e-02 7.619e-03 1.714 0.08701 .
## dvd_rel_day 4.566e-03 2.854e-03 1.600 0.11018
## critics_score 2.317e-02 1.081e-03 21.448 < 2e-16 ***
## imdb_num_votes 2.030e-06 2.719e-07 7.467 2.84e-13 ***
## best_pic_nomyes 1.197e-01 1.653e-01 0.724 0.46927
## best_pic_winyes -3.411e-01 2.931e-01 -1.164 0.24492
## best_dir_winyes 3.615e-02 1.095e-01 0.330 0.74137
## top200_boxyes -2.046e-01 1.775e-01 -1.153 0.24933
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6315 on 613 degrees of freedom
## (9 observations deleted due to missingness)
## Multiple R-squared: 0.6697, Adjusted R-squared: 0.6546
## F-statistic: 44.38 on 28 and 613 DF, p-value: < 2.2e-16
#drop best_director_win
mlr2_3<-lm(imdb_rating ~ title_type + genre + runtime +
mpaa_rating + thtr_rel_month + thtr_rel_day +
dvd_rel_month + dvd_rel_day + critics_score +
imdb_num_votes + best_pic_nom + best_pic_win +
top200_box, data=movies)
summary(mlr2_3)##
## Call:
## lm(formula = imdb_rating ~ title_type + genre + runtime + mpaa_rating +
## thtr_rel_month + thtr_rel_day + dvd_rel_month + dvd_rel_day +
## critics_score + imdb_num_votes + best_pic_nom + best_pic_win +
## top200_box, data = movies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.84495 -0.32706 0.05621 0.38222 1.82088
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.537e+00 3.393e-01 13.373 < 2e-16 ***
## title_typeFeature Film -1.148e-01 2.366e-01 -0.485 0.62757
## title_typeTV Movie -4.802e-01 3.710e-01 -1.294 0.19598
## genreAnimation -3.036e-01 2.494e-01 -1.217 0.22400
## genreArt House & International 6.539e-01 1.991e-01 3.284 0.00108 **
## genreComedy -1.041e-01 1.072e-01 -0.971 0.33184
## genreDocumentary 7.019e-01 2.528e-01 2.776 0.00567 **
## genreDrama 2.149e-01 9.440e-02 2.276 0.02318 *
## genreHorror -1.348e-01 1.587e-01 -0.850 0.39587
## genreMusical & Performing Arts 5.137e-01 2.177e-01 2.359 0.01863 *
## genreMystery & Suspense 1.472e-01 1.190e-01 1.237 0.21649
## genreOther 2.092e-03 1.818e-01 0.012 0.99082
## genreScience Fiction & Fantasy -2.810e-01 2.385e-01 -1.178 0.23932
## runtime 4.681e-03 1.523e-03 3.073 0.00221 **
## mpaa_ratingNC-17 -5.883e-01 4.816e-01 -1.222 0.22229
## mpaa_ratingPG -2.188e-01 1.802e-01 -1.214 0.22515
## mpaa_ratingPG-13 -3.025e-01 1.864e-01 -1.623 0.10508
## mpaa_ratingR -1.965e-01 1.798e-01 -1.093 0.27496
## mpaa_ratingUnrated -2.704e-01 2.072e-01 -1.305 0.19233
## thtr_rel_month 6.084e-03 7.513e-03 0.810 0.41836
## thtr_rel_day -1.028e-03 2.882e-03 -0.357 0.72146
## dvd_rel_month 1.293e-02 7.604e-03 1.700 0.08955 .
## dvd_rel_day 4.596e-03 2.851e-03 1.612 0.10741
## critics_score 2.322e-02 1.073e-03 21.644 < 2e-16 ***
## imdb_num_votes 2.032e-06 2.716e-07 7.481 2.57e-13 ***
## best_pic_nomyes 1.162e-01 1.649e-01 0.705 0.48134
## best_pic_winyes -3.148e-01 2.819e-01 -1.117 0.26444
## top200_boxyes -2.063e-01 1.773e-01 -1.164 0.24488
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.631 on 614 degrees of freedom
## (9 observations deleted due to missingness)
## Multiple R-squared: 0.6696, Adjusted R-squared: 0.6551
## F-statistic: 46.09 on 27 and 614 DF, p-value: < 2.2e-16
#drop theatre release day
mlr2_4<-lm(imdb_rating ~ title_type + genre + runtime +
mpaa_rating + thtr_rel_month + dvd_rel_month +
dvd_rel_day + critics_score + imdb_num_votes +
best_pic_nom + best_pic_win + top200_box, data=movies)
summary(mlr2_4)##
## Call:
## lm(formula = imdb_rating ~ title_type + genre + runtime + mpaa_rating +
## thtr_rel_month + dvd_rel_month + dvd_rel_day + critics_score +
## imdb_num_votes + best_pic_nom + best_pic_win + top200_box,
## data = movies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.84540 -0.33022 0.05701 0.38755 1.81011
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.525e+00 3.374e-01 13.412 < 2e-16 ***
## title_typeFeature Film -1.146e-01 2.364e-01 -0.485 0.62797
## title_typeTV Movie -4.831e-01 3.706e-01 -1.303 0.19293
## genreAnimation -3.047e-01 2.492e-01 -1.222 0.22203
## genreArt House & International 6.480e-01 1.983e-01 3.268 0.00114 **
## genreComedy -1.029e-01 1.071e-01 -0.961 0.33697
## genreDocumentary 7.011e-01 2.526e-01 2.775 0.00568 **
## genreDrama 2.140e-01 9.430e-02 2.270 0.02357 *
## genreHorror -1.370e-01 1.585e-01 -0.865 0.38763
## genreMusical & Performing Arts 5.123e-01 2.175e-01 2.355 0.01885 *
## genreMystery & Suspense 1.460e-01 1.188e-01 1.229 0.21968
## genreOther 5.142e-03 1.815e-01 0.028 0.97741
## genreScience Fiction & Fantasy -2.755e-01 2.379e-01 -1.158 0.24720
## runtime 4.698e-03 1.522e-03 3.087 0.00211 **
## mpaa_ratingNC-17 -5.881e-01 4.812e-01 -1.222 0.22213
## mpaa_ratingPG -2.206e-01 1.800e-01 -1.226 0.22069
## mpaa_ratingPG-13 -3.063e-01 1.859e-01 -1.648 0.09996 .
## mpaa_ratingR -1.973e-01 1.797e-01 -1.098 0.27255
## mpaa_ratingUnrated -2.701e-01 2.071e-01 -1.304 0.19264
## thtr_rel_month 5.761e-03 7.453e-03 0.773 0.43984
## dvd_rel_month 1.295e-02 7.598e-03 1.704 0.08883 .
## dvd_rel_day 4.596e-03 2.849e-03 1.613 0.10720
## critics_score 2.321e-02 1.072e-03 21.656 < 2e-16 ***
## imdb_num_votes 2.026e-06 2.710e-07 7.478 2.61e-13 ***
## best_pic_nomyes 1.180e-01 1.647e-01 0.717 0.47393
## best_pic_winyes -3.172e-01 2.816e-01 -1.126 0.26049
## top200_boxyes -2.054e-01 1.771e-01 -1.160 0.24660
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6306 on 615 degrees of freedom
## (9 observations deleted due to missingness)
## Multiple R-squared: 0.6695, Adjusted R-squared: 0.6556
## F-statistic: 47.92 on 26 and 615 DF, p-value: < 2.2e-16
#drop title_type
mlr2_5<-lm(imdb_rating ~ + genre + runtime + mpaa_rating +
thtr_rel_month + dvd_rel_month + dvd_rel_day +
critics_score + imdb_num_votes + best_pic_nom +
best_pic_win + top200_box, data=movies)
summary(mlr2_5)##
## Call:
## lm(formula = imdb_rating ~ +genre + runtime + mpaa_rating + thtr_rel_month +
## dvd_rel_month + dvd_rel_day + critics_score + imdb_num_votes +
## best_pic_nom + best_pic_win + top200_box, data = movies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.84241 -0.32963 0.05357 0.38917 1.82914
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.399e+00 2.393e-01 18.380 < 2e-16 ***
## genreAnimation -3.061e-01 2.492e-01 -1.228 0.21981
## genreArt House & International 6.533e-01 1.980e-01 3.299 0.00102 **
## genreComedy -1.002e-01 1.070e-01 -0.937 0.34934
## genreDocumentary 8.149e-01 1.471e-01 5.540 4.47e-08 ***
## genreDrama 2.077e-01 9.415e-02 2.206 0.02775 *
## genreHorror -1.332e-01 1.584e-01 -0.841 0.40080
## genreMusical & Performing Arts 5.487e-01 2.059e-01 2.665 0.00789 **
## genreMystery & Suspense 1.456e-01 1.188e-01 1.226 0.22083
## genreOther -2.039e-02 1.805e-01 -0.113 0.91008
## genreScience Fiction & Fantasy -2.761e-01 2.379e-01 -1.161 0.24610
## runtime 4.739e-03 1.521e-03 3.116 0.00192 **
## mpaa_ratingNC-17 -5.850e-01 4.812e-01 -1.216 0.22456
## mpaa_ratingPG -2.184e-01 1.799e-01 -1.214 0.22535
## mpaa_ratingPG-13 -3.047e-01 1.859e-01 -1.639 0.10170
## mpaa_ratingR -1.994e-01 1.796e-01 -1.110 0.26753
## mpaa_ratingUnrated -2.864e-01 2.057e-01 -1.393 0.16423
## thtr_rel_month 6.048e-03 7.449e-03 0.812 0.41716
## dvd_rel_month 1.265e-02 7.591e-03 1.666 0.09616 .
## dvd_rel_day 4.795e-03 2.844e-03 1.686 0.09230 .
## critics_score 2.330e-02 1.060e-03 21.975 < 2e-16 ***
## imdb_num_votes 2.033e-06 2.705e-07 7.514 2.02e-13 ***
## best_pic_nomyes 1.187e-01 1.646e-01 0.721 0.47121
## best_pic_winyes -3.238e-01 2.815e-01 -1.150 0.25057
## top200_boxyes -2.077e-01 1.771e-01 -1.173 0.24130
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6305 on 617 degrees of freedom
## (9 observations deleted due to missingness)
## Multiple R-squared: 0.6685, Adjusted R-squared: 0.6556
## F-statistic: 51.85 on 24 and 617 DF, p-value: < 2.2e-16
#drop best pic nom
mlr2_6<-lm(imdb_rating ~ + genre + runtime + mpaa_rating +
thtr_rel_month + dvd_rel_month + dvd_rel_day +
critics_score + imdb_num_votes + best_pic_win +
top200_box, data=movies)
summary(mlr2_6)##
## Call:
## lm(formula = imdb_rating ~ +genre + runtime + mpaa_rating + thtr_rel_month +
## dvd_rel_month + dvd_rel_day + critics_score + imdb_num_votes +
## best_pic_win + top200_box, data = movies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.84395 -0.32229 0.05043 0.39260 1.82894
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.377e+00 2.373e-01 18.447 < 2e-16 ***
## genreAnimation -3.051e-01 2.491e-01 -1.225 0.221137
## genreArt House & International 6.554e-01 1.979e-01 3.312 0.000982 ***
## genreComedy -9.845e-02 1.069e-01 -0.921 0.357361
## genreDocumentary 8.157e-01 1.470e-01 5.548 4.30e-08 ***
## genreDrama 2.113e-01 9.398e-02 2.249 0.024888 *
## genreHorror -1.296e-01 1.583e-01 -0.819 0.413278
## genreMusical & Performing Arts 5.463e-01 2.058e-01 2.655 0.008142 **
## genreMystery & Suspense 1.475e-01 1.187e-01 1.242 0.214684
## genreOther -8.534e-03 1.797e-01 -0.047 0.962132
## genreScience Fiction & Fantasy -2.767e-01 2.378e-01 -1.164 0.245027
## runtime 4.832e-03 1.515e-03 3.190 0.001493 **
## mpaa_ratingNC-17 -5.895e-01 4.809e-01 -1.226 0.220778
## mpaa_ratingPG -2.172e-01 1.799e-01 -1.208 0.227624
## mpaa_ratingPG-13 -3.031e-01 1.858e-01 -1.631 0.103340
## mpaa_ratingR -2.003e-01 1.796e-01 -1.116 0.265030
## mpaa_ratingUnrated -2.893e-01 2.055e-01 -1.408 0.159728
## thtr_rel_month 6.749e-03 7.383e-03 0.914 0.361000
## dvd_rel_month 1.282e-02 7.584e-03 1.690 0.091526 .
## dvd_rel_day 4.851e-03 2.842e-03 1.707 0.088285 .
## critics_score 2.339e-02 1.053e-03 22.207 < 2e-16 ***
## imdb_num_votes 2.058e-06 2.682e-07 7.672 6.62e-14 ***
## best_pic_winyes -2.417e-01 2.574e-01 -0.939 0.348040
## top200_boxyes -2.111e-01 1.770e-01 -1.193 0.233426
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6303 on 618 degrees of freedom
## (9 observations deleted due to missingness)
## Multiple R-squared: 0.6683, Adjusted R-squared: 0.6559
## F-statistic: 54.12 on 23 and 618 DF, p-value: < 2.2e-16
#drop theatre release month
mlr2_7<-lm(imdb_rating ~ + genre + runtime + mpaa_rating +
dvd_rel_month + dvd_rel_day + critics_score +
imdb_num_votes + best_pic_win + top200_box, data=movies)
summary(mlr2_7)##
## Call:
## lm(formula = imdb_rating ~ +genre + runtime + mpaa_rating + dvd_rel_month +
## dvd_rel_day + critics_score + imdb_num_votes + best_pic_win +
## top200_box, data = movies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.84531 -0.31255 0.05841 0.38831 1.84778
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.396e+00 2.363e-01 18.600 < 2e-16 ***
## genreAnimation -2.969e-01 2.489e-01 -1.193 0.233324
## genreArt House & International 6.589e-01 1.979e-01 3.330 0.000920 ***
## genreComedy -9.415e-02 1.068e-01 -0.882 0.378210
## genreDocumentary 8.191e-01 1.470e-01 5.574 3.73e-08 ***
## genreDrama 2.104e-01 9.396e-02 2.239 0.025499 *
## genreHorror -1.277e-01 1.582e-01 -0.807 0.419843
## genreMusical & Performing Arts 5.485e-01 2.057e-01 2.666 0.007875 **
## genreMystery & Suspense 1.421e-01 1.186e-01 1.199 0.231149
## genreOther -1.713e-02 1.794e-01 -0.095 0.923949
## genreScience Fiction & Fantasy -2.814e-01 2.377e-01 -1.184 0.236853
## runtime 5.132e-03 1.478e-03 3.472 0.000553 ***
## mpaa_ratingNC-17 -5.943e-01 4.808e-01 -1.236 0.216935
## mpaa_ratingPG -2.151e-01 1.798e-01 -1.196 0.232093
## mpaa_ratingPG-13 -3.069e-01 1.857e-01 -1.652 0.098946 .
## mpaa_ratingR -1.984e-01 1.795e-01 -1.105 0.269594
## mpaa_ratingUnrated -2.920e-01 2.055e-01 -1.421 0.155847
## dvd_rel_month 1.167e-02 7.480e-03 1.561 0.119083
## dvd_rel_day 4.930e-03 2.840e-03 1.736 0.083092 .
## critics_score 2.339e-02 1.053e-03 22.207 < 2e-16 ***
## imdb_num_votes 2.067e-06 2.680e-07 7.712 4.95e-14 ***
## best_pic_winyes -2.454e-01 2.573e-01 -0.954 0.340612
## top200_boxyes -2.027e-01 1.767e-01 -1.147 0.251855
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6302 on 619 degrees of freedom
## (9 observations deleted due to missingness)
## Multiple R-squared: 0.6678, Adjusted R-squared: 0.656
## F-statistic: 56.56 on 22 and 619 DF, p-value: < 2.2e-16
#drop best pic win
mlr2_8<-lm(imdb_rating ~ + genre + runtime + mpaa_rating +
dvd_rel_month + dvd_rel_day + critics_score +
imdb_num_votes + top200_box, data=movies)
summary(mlr2_8)##
## Call:
## lm(formula = imdb_rating ~ +genre + runtime + mpaa_rating + dvd_rel_month +
## dvd_rel_day + critics_score + imdb_num_votes + top200_box,
## data = movies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.8405 -0.3151 0.0567 0.3918 1.8509
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.415e+00 2.354e-01 18.753 < 2e-16 ***
## genreAnimation -2.974e-01 2.489e-01 -1.195 0.232632
## genreArt House & International 6.536e-01 1.978e-01 3.305 0.001004 **
## genreComedy -1.006e-01 1.065e-01 -0.944 0.345531
## genreDocumentary 8.145e-01 1.469e-01 5.545 4.34e-08 ***
## genreDrama 2.069e-01 9.389e-02 2.204 0.027915 *
## genreHorror -1.334e-01 1.581e-01 -0.844 0.399273
## genreMusical & Performing Arts 5.491e-01 2.057e-01 2.669 0.007805 **
## genreMystery & Suspense 1.390e-01 1.185e-01 1.173 0.241328
## genreOther -1.091e-02 1.793e-01 -0.061 0.951495
## genreScience Fiction & Fantasy -2.803e-01 2.376e-01 -1.180 0.238617
## runtime 4.985e-03 1.470e-03 3.391 0.000741 ***
## mpaa_ratingNC-17 -5.931e-01 4.808e-01 -1.233 0.217867
## mpaa_ratingPG -2.169e-01 1.798e-01 -1.207 0.228042
## mpaa_ratingPG-13 -3.018e-01 1.856e-01 -1.625 0.104565
## mpaa_ratingR -1.959e-01 1.795e-01 -1.092 0.275413
## mpaa_ratingUnrated -2.895e-01 2.055e-01 -1.409 0.159285
## dvd_rel_month 1.218e-02 7.460e-03 1.633 0.103064
## dvd_rel_day 4.766e-03 2.835e-03 1.681 0.093175 .
## critics_score 2.335e-02 1.052e-03 22.188 < 2e-16 ***
## imdb_num_votes 1.999e-06 2.583e-07 7.739 4.10e-14 ***
## top200_boxyes -1.997e-01 1.767e-01 -1.131 0.258624
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6301 on 620 degrees of freedom
## (9 observations deleted due to missingness)
## Multiple R-squared: 0.6673, Adjusted R-squared: 0.656
## F-statistic: 59.22 on 21 and 620 DF, p-value: < 2.2e-16
#as mpaa_ratingR has the next highest rating, and the other levels in mpaa_rating also have relatively high pvalue, this varialbe is dropped next
mlr2_9<-lm(imdb_rating ~ + genre + runtime + dvd_rel_month +
dvd_rel_day + critics_score + imdb_num_votes +
top200_box, data=movies)
summary(mlr2_9)##
## Call:
## lm(formula = imdb_rating ~ +genre + runtime + dvd_rel_month +
## dvd_rel_day + critics_score + imdb_num_votes + top200_box,
## data = movies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.82006 -0.32699 0.03731 0.38647 1.79584
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.244e+00 1.811e-01 23.440 < 2e-16 ***
## genreAnimation -1.646e-01 2.268e-01 -0.726 0.46826
## genreArt House & International 6.303e-01 1.937e-01 3.253 0.00120 **
## genreComedy -1.256e-01 1.055e-01 -1.191 0.23417
## genreDocumentary 7.613e-01 1.321e-01 5.762 1.31e-08 ***
## genreDrama 1.932e-01 9.176e-02 2.106 0.03563 *
## genreHorror -1.307e-01 1.550e-01 -0.843 0.39973
## genreMusical & Performing Arts 5.346e-01 2.046e-01 2.613 0.00920 **
## genreMystery & Suspense 1.418e-01 1.160e-01 1.222 0.22213
## genreOther -2.228e-02 1.785e-01 -0.125 0.90068
## genreScience Fiction & Fantasy -2.600e-01 2.375e-01 -1.095 0.27404
## runtime 4.531e-03 1.451e-03 3.122 0.00188 **
## dvd_rel_month 1.181e-02 7.444e-03 1.587 0.11300
## dvd_rel_day 4.447e-03 2.830e-03 1.572 0.11657
## critics_score 2.361e-02 1.020e-03 23.151 < 2e-16 ***
## imdb_num_votes 1.984e-06 2.537e-07 7.821 2.25e-14 ***
## top200_boxyes -1.815e-01 1.742e-01 -1.042 0.29779
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6302 on 625 degrees of freedom
## (9 observations deleted due to missingness)
## Multiple R-squared: 0.6645, Adjusted R-squared: 0.6559
## F-statistic: 77.37 on 16 and 625 DF, p-value: < 2.2e-16
#drop top200_box
mlr2_10<-lm(imdb_rating ~ + genre + runtime + dvd_rel_month +
dvd_rel_day + critics_score + imdb_num_votes, data=movies)
summary(mlr2_10)##
## Call:
## lm(formula = imdb_rating ~ +genre + runtime + dvd_rel_month +
## dvd_rel_day + critics_score + imdb_num_votes, data = movies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.82086 -0.32935 0.03754 0.38655 1.79387
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.244e+00 1.811e-01 23.440 < 2e-16 ***
## genreAnimation -1.523e-01 2.265e-01 -0.672 0.501640
## genreArt House & International 6.410e-01 1.935e-01 3.313 0.000976 ***
## genreComedy -1.161e-01 1.051e-01 -1.105 0.269705
## genreDocumentary 7.736e-01 1.316e-01 5.878 6.77e-09 ***
## genreDrama 2.047e-01 9.111e-02 2.247 0.025010 *
## genreHorror -1.202e-01 1.547e-01 -0.777 0.437472
## genreMusical & Performing Arts 5.485e-01 2.042e-01 2.686 0.007414 **
## genreMystery & Suspense 1.543e-01 1.154e-01 1.337 0.181644
## genreOther -1.483e-02 1.784e-01 -0.083 0.933773
## genreScience Fiction & Fantasy -2.667e-01 2.374e-01 -1.123 0.261681
## runtime 4.465e-03 1.450e-03 3.079 0.002166 **
## dvd_rel_month 1.181e-02 7.444e-03 1.586 0.113222
## dvd_rel_day 4.448e-03 2.830e-03 1.572 0.116553
## critics_score 2.354e-02 1.018e-03 23.130 < 2e-16 ***
## imdb_num_votes 1.924e-06 2.470e-07 7.787 2.85e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6303 on 626 degrees of freedom
## (9 observations deleted due to missingness)
## Multiple R-squared: 0.6639, Adjusted R-squared: 0.6559
## F-statistic: 82.45 on 15 and 626 DF, p-value: < 2.2e-16
#drop dvd_release_day
mlr2_11<-lm(imdb_rating ~ + genre + runtime + dvd_rel_month +
critics_score + imdb_num_votes, data=movies)
summary(mlr2_11)##
## Call:
## lm(formula = imdb_rating ~ +genre + runtime + dvd_rel_month +
## critics_score + imdb_num_votes, data = movies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.86102 -0.33461 0.04813 0.38521 1.83572
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.307e+00 1.769e-01 24.350 < 2e-16 ***
## genreAnimation -1.450e-01 2.267e-01 -0.639 0.522797
## genreArt House & International 6.457e-01 1.937e-01 3.334 0.000906 ***
## genreComedy -1.114e-01 1.052e-01 -1.059 0.289857
## genreDocumentary 7.840e-01 1.316e-01 5.957 4.29e-09 ***
## genreDrama 2.173e-01 9.086e-02 2.392 0.017059 *
## genreHorror -1.193e-01 1.549e-01 -0.770 0.441476
## genreMusical & Performing Arts 5.737e-01 2.038e-01 2.815 0.005026 **
## genreMystery & Suspense 1.626e-01 1.154e-01 1.409 0.159313
## genreOther 5.151e-04 1.783e-01 0.003 0.997696
## genreScience Fiction & Fantasy -2.497e-01 2.374e-01 -1.052 0.293355
## runtime 4.482e-03 1.452e-03 3.088 0.002107 **
## dvd_rel_month 1.156e-02 7.452e-03 1.552 0.121244
## critics_score 2.345e-02 1.017e-03 23.051 < 2e-16 ***
## imdb_num_votes 1.922e-06 2.473e-07 7.770 3.21e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.631 on 627 degrees of freedom
## (9 observations deleted due to missingness)
## Multiple R-squared: 0.6626, Adjusted R-squared: 0.6551
## F-statistic: 87.96 on 14 and 627 DF, p-value: < 2.2e-16
#drop dvd_release_month
mlr2_fin<-lm(imdb_rating ~ + genre + runtime + critics_score +
imdb_num_votes, data=movies)
summary(mlr2_fin)##
## Call:
## lm(formula = imdb_rating ~ +genre + runtime + critics_score +
## imdb_num_votes, data = movies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.9094 -0.3305 0.0380 0.3873 1.8059
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.381e+00 1.673e-01 26.185 < 2e-16 ***
## genreAnimation -1.634e-01 2.272e-01 -0.719 0.47231
## genreArt House & International 5.454e-01 1.883e-01 2.896 0.00391 **
## genreComedy -1.161e-01 1.046e-01 -1.110 0.26758
## genreDocumentary 7.861e-01 1.304e-01 6.029 2.80e-09 ***
## genreDrama 2.118e-01 9.017e-02 2.349 0.01915 *
## genreHorror -1.172e-01 1.551e-01 -0.756 0.45006
## genreMusical & Performing Arts 5.548e-01 2.043e-01 2.716 0.00680 **
## genreMystery & Suspense 1.578e-01 1.152e-01 1.369 0.17137
## genreOther -1.039e-02 1.787e-01 -0.058 0.95369
## genreScience Fiction & Fantasy -4.185e-01 2.260e-01 -1.852 0.06446 .
## runtime 4.350e-03 1.451e-03 2.997 0.00283 **
## critics_score 2.374e-02 1.018e-03 23.315 < 2e-16 ***
## imdb_num_votes 1.938e-06 2.483e-07 7.804 2.48e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6347 on 636 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.6646, Adjusted R-squared: 0.6577
## F-statistic: 96.92 on 13 and 636 DF, p-value: < 2.2e-16
Now for imdb_rating, we have a multiple linear model of 4 explanatory variables: genre, runtime, critics_score, and imdb_num_votes. The variable genre has 11 levels, and 4 out of 10 have a pvalue smaller than 0.05. Therefore genre stays in the model.
The reference level for genre is Action and adventure.The equation for the final regression model could be written as:
imdb_rating=4.38-0.16genreAnimation+0.55Genre Arthouse & International-0.12GenreComedy + 0.79genreDocumentary+0.21genreDrama-0.12GenreHorror+0.55GenreMusical & performing arts+0.16Genre Mystery & suspense+0.01Genreother-0.42genre Science Fiction & Fantasy+0.004runtime+0.002critics_score+1.938e-06*imdb_num_votes
In the context of the data, this model predicts that, all else held constant, movies that are of the genre Arthouse and international, or musical and performing, are expected to have an imdb rating on average 0.55 higher than movies of other genres. All else held constant, model predicts that a movie with an additional score in critics_rating is expected, on average, to have an additional 0.002 imdb rating. All else held constant, the model also predicts movies having an additional imdb_num_vote will have a rating 1.938e-06 higher.
Before we can use the models to predict audience scores or imdb_rating, the conditions for multiple linear regression models need to be checked.
Condition 1: Linear relationships between numerical explanatory variables and the response variable. Residual plot is used to check this condition.
The numeric explanatory variable to be plotted in the 1st model is critics_score. The numerical variables in the 2nd model are runtime, critics_score, and imdb_num_votes.
#checking residuals vs.critics_score in model 1
plot(mlr_fin$residuals, mlr_fin$critics_score)#checking residuals vs.runtime in model 2
plot(mlr2_fin$residuals, mlr2_fin$runtime)#checking residuals vs.critics_score in model 2
plot(mlr2_fin$residuals, mlr2_fin$critics_score)#checking residuals vs.imdb_num_votes in model 2
plot(mlr2_fin$residuals, mlr2_fin$imdb_num_votes)The above plots show residuals randomly scattered around zero. Therefore the first condition for multiple linear regression is met.
#check residuals in model 1
hist( mlr_fin$residuals) #using normal probability plot to check the distribution of the residuals in model 1
qqnorm(mlr_fin$residuals)
qqline(mlr_fin$residuals)As shown in the plots above, the distribution of the residuals in model 1 follow a nearly normal distribution. Now let’s check model 2.
hist(mlr2_fin$residuals)qqnorm(mlr2_fin$residuals)
qqline(mlr2_fin$residuals)As shown above, the distribution of the residuals in model 2 is not exactly normal.
Condition 3: constant variability of residuals. Residual plots of residuals vs. predicated values are used.
#residuals plot for model 1
plot(mlr_fin$residuals ~ mlr_fin$fitted)plot(abs(mlr_fin$residuals) ~ mlr_fin$fitted)#residuals plot for model 2
plot(mlr2_fin$residuals ~ mlr2_fin$fitted)plot(abs(mlr2_fin$residuals) ~ mlr2_fin$fitted)The above plots show a constant variability of residuals in model 1. In model 2, there seems to be a slightly larger variability for predicted ratings below 6 than those above 6.
#residuals in model 1
plot( mlr_fin$residuals)#residuals in model 2
plot(mlr2_fin$residuals)The plots above shows that there is no pattern to the residuals in both models. It is therefore reasonable to consider the residuals independent from one another in both models.
Overall, based on the diagnostic plots, model 1 can be considered a reliable multiple linear regression model to predict audience_scores. Model 2 can still be useful but is not as reliable a model to predict imdb_rating of movies. Model 2 probably can be improved if the outlier observations in imdb_rating are removed or treated separately.
Let’s use the two models to predict the audience_score and imdb_rating of a 2016 movie: The girl on the train.
The movie has 165,696 votes on imdb. Its runtime is 112 minutes and genre is mystery. Its critics score on Rotten Tomatoes is 44, and the movie hasn’t been nominated for an Oscar best picture. Information is found on https://www.rottentomatoes.com/m/the_girl_on_the_train_2016 and https://www.imdb.com/title/tt3631112/
#create new data.frame containing the information about the movie
newmovie<-data.frame(genre="Mystery & Suspense", runtime=112, imdb_num_votes=165696, critics_score=44, best_pic_nom="no" )
newmovie## genre runtime imdb_num_votes critics_score best_pic_nom
## 1 Mystery & Suspense 112 165696 44 no
#predict the audience_score using model 1 mlr_fin
predict(mlr_fin, newmovie)## 1
## 50.75111
#quantifying uncertainty around prediction using prediction interval
predict(mlr_fin, newmovie, interval="prediction", level=0.95)## fit lwr upr
## 1 50.75111 23.19339 78.30884
The audience score on Rotten Tomato is 49, which is pretty close to the predicted score using the model, 50.75111. Using the prediction interval, we are 95% confident that the model predicts that a movie having the genre of mystery and suspense, a critics score of 44, and not been nominated for Oscar best picture award, is expected to have an audience score between 23 to 78.
Now let’s use model 2 to predict the imdb_rating of this movie.
predict(mlr2_fin, newmovie)## 1
## 6.392136
#quantifying the uncertainty around the prediction using prediction interval.
predict(mlr2_fin, newmovie, interval="prediction", level=0.95)## fit lwr upr
## 1 6.392136 5.134217 7.650054
The imdb_rating for the movie on imdb site is 6.5, close to the predicted rating, 6.39. Using the prediction interval, the model predicts, with 95% confidence, that a movie of the genre of mystery and suspense, length of 112 minutes, critics score of 44, and having 165696 number of votes, is expected to have an imdb rating between 5.1 to 7.7.
To conclude, it’s interesting to see that the popularity of movies as shown on two websites, imdb and Rotten Tomatoes, slightly differ in terms of their significant predictors. While genre seems to be significant factor in movie popularity for both websites, movies of which specific genres tend to more popular than others differ. If popularity is the goal, the models predict that it might be safer to produce and purchase Arthouse movies or musicals, and horror, based on the first model, may not be the best bet.
Whether the movie has been nominated for an Oscar best picture is shown in model 1 to be a significant predictor for popularity audience, which is reasonable since the nomination comes from the audience after all. However, whether the movie has won the Oscar or not doesn’t seem to be a significant predictor. It’s also interesting to see that runtime is a significant predictor for ratings on IMDB.
Finally, it needs to be noted that, the value approach was used in the model selection in this report. Therefore, the models may not be the ones that have the best prediction accuracy. This can be shown in the R-squared and adjusted R-squared in the models, which is just above 50% and 60% respectively. If prediction accuracy is the goal, the R-squared approached could be used.